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Abstract 

We introduce UCF 101 which is currently the largest 
dataset of human actions. It consists of 101 action classes, 
over 13k clips and 27 hours of video data. The database 
consists of realistic user-uploaded videos containing cam- 
era motion and cluttered background. Additionally, we pro- 
vide baseline action recognition results on this new dataset 
using standard bag of words approach with overall perfor- 
mance of 44.5%. To the best of our knowledge, UCF 101 
is currently the most challenging dataset of actions due to 
its large number of classes, large number of clips and also 
unconstrained nature of such clips. 

1. Introduction 

The majority of existing action recognition datasets suf- 
fer from two disadvantages: 1) The number of their classes 
is typically very low compared to the richness of performed 
actions by humans in reality, e.g. KTH [ ], Weizmann [ ], 
UCF Sports [ ], IXMAS [ ] datasets includes only 6, 9, 
9,11 classes respectively. 2) The videos are recorded in un- 
realistically controlled environments. For instance, KTH, 
Weizmann, IXMAS are staged by actors; HOHA [ ] and 
UCF Sports are composed of movie clips captured by pro- 
fessional filming crew. Recently, web videos have been 
used in order to utilize unconstrained user-uploaded data to 
alleviate the second issue [6, 8, 9, 5]. However, the first dis- 
advantage remains unresolved as the largest existing dataset 
does not include more than 5 1 actions while several works 
showed that the number of classes play a crucial role in eval- 
uating an action recognition method [ , ]. Therefore, we 
have compiled a new dataset with 101 actions and 13320 
clips which is nearly twice bigger than the largest existing 
dataset in terms of number of actions and clips. (HMDB51 
[ ] and UCF50 [ ] are the currently the largest ones with 
6766 clips of 51 actions and 6681 clips of 50 actions re- 
spectively.) 

The dataset is composed of web videos which are 
recorded in unconstrained environments and typically in- 




Figure 1. Sample frames for 6 action classes of UCFlOl. 



elude camera motion, various lighting conditions, partial 
occlusion, low quality frames, etc. Fig. 1 shows sample 
frames of 6 action classes from UCFlOl. 

2. Dataset Details 

Action Classes: UCFlOl includes total number of 
101 action classes which we have divided into five types: 
Human-Object Interaction, Body-Motion Only, Human- 
Human Interaction, Playing Musical Instruments, Sports. 

UCFlOl is an extension of UCF50 which included the 
following 50 action classes: {Baseball Pitch, Basketball 
Shooting, Bench Press, Biking, Billiards Shot, Breaststroke, 
Clean and Jerk, Diving, Drumming, Fencing, Golf Swing, 
High Jump, Horse Race, Horse Riding, Hula Hoop, Javelin 
Throw,, Juggling Balls, Jumping Jack, Jump Rope, Kayak- 
ing, Lunges, Military Parade, Mixing Batter, Nun chucks, 
Pizza Tossing, Playing Guitar, Playing Piano, Playing 
Tabla, Playing Violin, Pole Vault, Pommel Horse, Pull Ups, 
Punch, Push Ups, Rock Climbing Indoor, Rope Climbing, 
Rowing, Salsa Spins, Skate Boarding, Skiing, Skijet, Soc- 
cer Juggling, Swing, TaiChi, Tennis Swing, Throw Discus, 
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Figure 2. 101 actions included in UCFlOl shown with one sample frame. The color of frame borders specifies to which action type they 
belong: Human-Object Interaction, Body-Motion Only, Human-Human Interaction, Playing Musical Instruments, Sports. 



Trampoline Jumping, Volleyball Spiking, Walking with a 
dog, Yo Yo}. The color class labels specify which prede- 
fined action type they belong to. 

The following 5 1 new classes are introduced in UCFlOl : 
{Apply Eye Makeup, Apply Lipstick, Archery, Baby Crawl- 
ing, Balance Beam, Band Marching, Basketball Dunk, Blow 
Drying Hair, Blowing Candles, Body Weight Squats, Bowl- 



ing, Boxing-Punching Bag, Boxing-Speed Bag, Brushing 
Teeth, Cliff Diving, Cricket Bowling, Cricket Shot, Cut- 
ting In Kitchen, Field Hockey Penalty, Floor Gymnastics, 
Frisbee Catch, Front Crawl, Hair cut. Hammering, Ham- 
mer Throw, Handstand Pushups, Handstand Walking, Head 
Massage, Ice Dancing, Knitting, Long Jump, Mopping 
Floor, Parallel Bars, Playing Cello, Playing Daf, Playing 
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Figure 3. Number of clips per action class. The distribution of clip durations is illustrated by the colors. 



Dhol, Playing Flute, Playing Sitar, Rafting, Shaving Beard, 
Shot put, Sky Diving, Soccer Penalty, Still Rings, Sumo 
Wrestling, Surfing, Table Tennis Shot, Typing, Uneven Bars, 
Wall Pushups, Writing On Board}. Fig. 2 shows a sample 
frame for each action class of UCFIOI. 

Clip Groups: The clips of one action class are divided 
into 25 groups which contain 4-7 clips each. The clips in 
one group share some common features, such as the back- 
ground or actors. 

The bar chart of Fig. 3 shows the number of clips in 
each class. The colors on each bar illustrate the durations 
of different clips included in that class. The chart shown in 
Fig. 4 illustrates the average clip length (green) and total 
duration of clips (blue) for each action class. 

The videos are downloaded from YouTube [ ] and the 
irrelevant ones are manually removed. All clips have fixed 
frame rate and resolution of 25 FPS and 320 x 240 respec- 
tively. The videos are saved in . avi files compressed us- 
ing DivX codec available in k-lite package [ ]. The audio 
is preserved for the clips of the new 51 actions. Table 1 
summarizes the characteristics of the dataset. 



Actions 


101 


Clips 


13320 


Groups per Action 


25 


Clips per Group 


4-7 


Mean Clip Length 


7.21 sec 


Total Duration 


1600 mins 


Min Clip Length 


1.06 sec 


Max Clip Length 


71.04 sec 


Frame Rate 


25fps 


Resolution 


320x240 


Audio 


Yes (5 1 actions) 



Table 1. Summary of Characteristics of UCFlOl 



Naming Convention: The zipped file of the dataset 

(available at http://crcv.ucf.edu/data/ 
UCFlOl.php ) includes 101 folders each containing 
the clips of one action class. The name of each clip has the 
following form: 



vJX_gY_cZ.avi 
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Figure 4. Total time of videos for each class is illustrated using the blue bars. The average length of the clips for each action is depicted in 
green. 



where X, Y and Z represent action class label, 
group and clip number respectively. For instance, 

v_ApplyEyeMakeup_g0 3_c0 4.avi corresponds to 
the clip 4 of group 3 of action class Apply EyeMakeup. 

3. Experimental Results 

We performed an experiment using bag of words ap- 
proach which is widely accepted as a standard action recog- 
nition method to provide baseline results on UCFIOI. 

From each clip, we extracted HarrisSD comers (using 
the implementation by [7]) and computed 162 dimensional 
HOG/HOF descriptors for each. We clustered a randomly 
selected set of 100,000 space-time interest points (STIP) us- 
ing k-means to build the codebook. The size of our code- 
book is k=4000 which is shown to yield good results over 
a wide range of datasets. The descriptors were assigned to 
their closest video words using nearest neighbor classifier, 
and each clip was represented by a 4000-dimensional his- 
togram of its words. Utilizing a leave-one-group-out 25- 
fold cross validation scenario, a SVM was trained using 



the histogram vectors of the training folds. We employed a 
nonlinear multiclass SVM with histogram intersection ker- 
nel and 101 classes each representing one action. For test- 
ing, a similar histogram representation for the query video 
was computed and classified using the trained SVM. This 
method yielded an overall accuracy of 44.5%; The confu- 
sion matrix for all 101 actions is shown in Fig. 5. 

The accuracy for the predefined action types are: 
Sports (50.54%), Playing Musical Instrument (37.42%), 
Human-Object Interaction (38.52%), Body-Motion Only 
(36.26%), Human-Human Interaction (44.14%). Sports ac- 
tions achieve the highest accuracy since performing sports 
typically requires distinctive motions which makes the clas- 
sification easier. Moreover, the background in sports clips 
are generally less cluttered compared to other action types. 
Unlike Sports Actions, Human-Object Interaction clips typ- 
ically have a highly cluttered background. Additionally, the 
informative motions typically occupy a small portion of the 
motions in the clips which explains the low recognition ac- 
curacy of this action class. 



Dataset 


Number of Actions 


Clips 


Background 


Camera Motion 


Release Year 


Resource 


KTH [ ] 


6 


600 


Static 


Slight 


2004 


Actor Staged 


Weizmann ] 


9 


81 


Static 


No 


2005 


Actor Staged 


UCF Sports [10] 


9 


182 


Dynamic 


Yes 


2009 


TV, Movies 


IXMAS V^] 


11 


165 


Static 


No 


2006 


Actor Staged 


UCFll [ ] 


11 


1168 


Dynamic 


Yes 


2009 


YouTube 


HOHA [ ] 


12 


2517 


Dynamic 


Yes 


2009 


Movies 


Olympic [ ] 


16 


800 


Dynamic 


Yes 


2010 


YouTube 


UCF50 [ ] 


50 


6681 


Dynamic 


Yes 


2010 


YouTube 


HMDB51 [ ] 


51 


6766 


Dynamic 


Yes 


2011 


Movies, YouTube, Web 


UCFlOl 


101 


13320 


Dynamic 


Yes 


2012 


YouTube 



Table 2. Summary of Major Action Recognition Datasets 



We recommend a 25-fold cross validation experimental 
setup using all the videos in the dataset to keep consistency 
of the reported tests on UCFlOl; the baseline results pro- 
vided in this section were computed using the same sce- 
nario. 

4. Related Datasets 

UCF Sports, UCFll, UCF50 and UCFlOl are the four 
action datasets compiled by UCF in chronological order; 
each one includes its precursor. We made two minor mod- 
ifications in the portion of UCFlOl which includes UCF50 
videos: the number of groups is fixed to 25 for all the ac- 
tions, and each group includes up to 7 clips. Table 2 shows 
a list of existing action recognition datasets with detailed 
characteristics of each. Note that UCFlOl is remarkably 
larger than the rest. 

5. Conclusion 

We introduced UCFlOl which is the most challeng- 
ing dataset for action recognition compared to the exist- 
ing ones. It includes 101 action classes and over 13k clips 
which makes it outstandingly larger than other datasets. 
UCFlOl is composed of unconstrained videos downloaded 
from YouTube which feature challenges such as poor light- 
ing, cluttered background and severe camera motion. We 
provided baseline action recognition results on this new 
dataset using standard bag of words method with overall 
accuracy of 44.5%. 
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Figure 5. Confusion table of baseline action recognition results using bag of words approach on UCFIOI. The drawn lines separate different 
types of actions; 1-50: Sports, 51-60: Playing Musical Instrument, 61-80: Human-Object Interaction, 81-96: Body-Motion Only, 97-101: 
Human-Human Interaction. 



